Contents

1 Loading the data

For this exercise, we will use a mass cytometry (CyTOF) dataset from Bodenmiller et al. (2012), consisting of 8 paired samples (16 samples) of stimulated (BCR-XL) and unstimulated peripheral blood mononuclear cells (PBMCs) from healthy individuals. The dataset has been made available through Biocondcutor’s ExperimentHub and can be loaded into R as follows:

  1. Load the ExperimentHub package.
  2. Initialize a Hub instance eh with the ExperimentHub() function.
  3. Use query()to retrieve any records that match the keyword “Bodenmiller”.
  4. Load the “Bodenmiller_BCR_XL_flowSet” into R using eh[[<id>]].
library(ExperimentHub)
eh <- ExperimentHub()
q <- query(eh, "Bodenmiller")
fs <- eh[["EH2255"]]
## Warning: package 'S4Vectors' was built under R version 4.1.3

2 Constructing a SingleCellExperiment

Next, we want to construct an object of Bioconductor’s SingleCellExperiment (SCE) class. Herefor, we will first load the experimental panel (containing metadata on each marker) and metadata table (containing metadata on each sample).

  1. Execute the code below to load the panel and metadata tables.
  2. Give a brief description of each sample metadata variable.
# download panel & metadata 
url <- "http://imlspenticton.uzh.ch/robinson_lab/cytofWorkflow"
fns <- list(panel = "PBMC8_panel_v3.xlsx" , md = "PBMC8_metadata.xlsx")
for (fn in fns) download.file(file.path(url, fn), destfile = fn, mode = "wb") 

# load into R
library(readxl)
panel <- read_excel(fns$panel)
md <- read_excel(fns$md)
  1. Load the CATALYST package.
  2. Use prepData() to construct a SCE from the flowSet,
    panel and metadata loaded above.
  3. How many markers are there? How many cells are there? How many samples are there? How many cells per sample are there?
library(CATALYST)
sce <- prepData(fs, panel, md)

3 Clustering

CATALYST provides a simple wrapper to i) perform high resolution FlowSOM clustering; and, ii) lower resolution ConsensusClusterPlus metaclustering. Here, the data will be initially clustered into xdim x ydim groups. Secondly, the function will metacluster populations into 2 through maxK clusters.

Clusterings are stored under cluster_codes(.), and cluster IDs for a given clustering “x” can be accessed via cluster_ids(., "x") where x is a column name of cluster_codes.

  1. Cluster the data using an 8x8 SOM (high resolution), and into 2-10 metaclusters.
  2. Print a table of the number of cells from each sample (sce$sample_id) assigned to each of the 5 metaclusters (meta5).
  3. Visualize the frequencies for the 8 metaclusters (meta8) using plotAbundances();
    try both options for argument by and choose the one that is easier to interpret.
  4. Give an example of clusters that are equally, more or less frequent in one condition.
sce <- cluster(sce, xdim = 8, ydim = 8, maxK = 10, verbose = FALSE)
table(sce$sample_id, cluster_ids(sce, "meta5"))
##         
##             1    2    3    4    5
##   BCRXL1  276  750    6 1666  140
##   BCRXL2 2363 5912  161 7842  397
##   BCRXL3 2960 3883  209 4453  747
##   BCRXL4 2301 2801  155 3295  438
##   BCRXL5 1658 3124  191 3112  458
##   BCRXL6 2037 3761  147 2251  426
##   BCRXL7 2563 3871  210 7440  686
##   BCRXL8 2760 3597  169 4641  486
##   Ref1    391 1167    9  993  179
##   Ref2   3623 7596  189 4642  675
##   Ref3   2741 3106  149 2382 1056
##   Ref4   2559 2002  102 1841  402
##   Ref5   3757 4459  216 2771  759
##   Ref6   2939 5035  126 2113  825
##   Ref7   4394 4237  176 6119 1048
##   Ref8   3729 4887  149 4165  740
plotAbundances(sce, k = "meta8", by = "cluster_id")

4 Type & state markers

The methodology presented here is based on a dichotomy of markers into “type” markers (used for clustering cells into subpopulations) and “state” markers (tested for differences in expression across conditions). The set of type and state markers defined in the SCE can be viewed via type/state_markers(.).

This dichotomy is based on the assumption that type marker expressions are fairly consistent across experimental groups, i.e., stimulation does not alter cell type definitions. When breaking this assumption, there’s a risk that identified clusters don’t represent true subpopulations, but rather a group of cells whose expression has changed in one condition.

  1. Print the set of type markers pre-defined in the SCE.
  2. Using plotMultiHeatmap() with hm1 = FALSE, plot a series of heatmaps
    for the median type marker expression across the 8 metaclusters.
  3. Are there type markers that may violate the above assumption?
type_markers(sce)
##  [1] "CD3"    "CD45"   "CD4"    "CD20"   "CD33"   "CD123"  "CD14"   "IgM"   
##  [9] "HLA-DR" "CD7"
state_markers(sce)
##  [1] "pNFkB"  "pp38"   "pStat5" "pAkt"   "pStat1" "pSHP2"  "pZap70" "pStat3"
##  [9] "pSlp76" "pBtk"   "pPlcg2" "pErk"   "pLat"   "pS6"
# e.g., CD7 only expressed in BCRXL samples for most clusters
plotMultiHeatmap(sce, 
  k = "meta8",
  hm1 = FALSE, 
  hm2 = type_markers(sce))

5 Dimensionality reduction

One of the most popular ways to visualize single cell data is via dimensionality reduction (DR), where each cell is projected into a lower, usually two-dimensional, space. In this dataset, each of N cells initially is represented as a vector of M measurements (for M markers); in the dimension reduction, this \(N \times M\) matrix will be compressed into an \(N \times 2\) or \(N \times 3\) matrix for visualization, with the hope that the 3 dimensions retains most of the dominant signal in the data. Here, we will use a nonlinear dimensionality reduction technique, uniform manifold approximation and projection (UMAP).

When coloring DR plots by, e.g., sample/patient ID, cells should be spread homogeneously, while grouping of cells from the same sample/patient is indicative of batch/patient effects. When coloring by experimental condition, we expect shifts in the 2D-projection of cells when the condition has a global effect on expression, and no shift if it does not.

  1. Using runDR(), compute a UMAP embedding of at most 500 cells per sample.
  2. Using plotDR(), plot UMAP embeddings colored by patient_id and condition.
  3. Is there any apparent patient effect (i.e., do the cells from different patients differ substantially from each other)? How does BCRXL stimulation manifest itself? Hint: have a look at a particular state marker, pS6. Through visualizations, can you observe changes in pS6 signal within specific cell subpopulations (clusters) ? What are the attributes of these changed subpopulations in terms of their type markers?
sce <- runDR(sce, dr = "UMAP", cells = 500)
# no visible patient effect; cells mix more or less 
# homogeneously when colored by patient
plotDR(sce, color_by = "patient_id") 

# slight shift between cells from different treatment groups
plotDR(sce, color_by = "condition")

6 Differential discovery (optional)

This exercise is advanced and covers material that has not been discussed in lectures. If you have time and interest, you are welcome to try and answer this question.

6.1 Exploratory data analysis

pbMDS() renders a multi-dimensional scaling (MDS) plot on median expression values. Such a plot will give a sense of similarities between samples in an unsupervised way and of key difference in expression before conducting any formal testing.

  1. Generate a MDS plot with samples colored by condition and labelled by patient_id.
  2. Comment on how samples group together, and what each axis separates from one another.
# 1st dim. separates by condition, 2nd by patient
# e.g., patients 5,7,8 and 1,2,3 group together
pbMDS(sce, color_by = "condition", label_by = "patient_id")

  1. Use plotMultiHeatmap() to visualize the set of state markers across the 8 metaclusters.
  2. Give examples of markers that are unaffected, up or down regulated in the treatment group,
    and state whether these changes are rather subpopulation-specific or global.
# e.g., pNFkB up in Ref, pS6 up in BCRXL, pErk unaffected
plotMultiHeatmap(sce, 
  k = "meta8", 
  hm1 = FALSE, 
  hm2 = state_markers(sce))

6.2 Differential State (DS) analysis

DS analysis aims at identifying markers whose expression is differential across conditions, within each subpopulation. For differential testing, we will use methods implemented in the diffcyt package. The diffcyt() wrapper function returns a list containing differential testing results under rowData(.$res).

To present how accounting for the between patient variability with the mixed model increases sensitivity, we will fit both a regular linear model, as well as a linear mixed model with a random intercept for each patient.

  1. Load the diffcyt package.
  2. Fit a LMM to test for DS with and without accounting for patient effects, and using the 8 metaclusters.
  3. Print a table of the number of DS findings for each model at an FDR cutoff of 5%.
  4. How many tests are there in total and why?
  5. Comment on what effect accounting for patient variability has on sensitivity.
# load diffcyt
library(diffcyt)

# extract experimental design table
ei <- metadata(sce)$experiment_info

# create model formulas with & without patient effect
ds_formula1 <- createFormula(ei, "condition", "patient_id")
ds_formula2 <- createFormula(ei, "condition") 

# create contrast matrix
contrast <- createContrast(c(0,1))

# run diffcyt 2x
res_ds1 <- diffcyt(sce, ei, 
    formula = ds_formula1, contrast = contrast,
    analysis_type = "DS", method_DS = "diffcyt-DS-LMM",
    clustering_to_use = "meta8")

res_ds2 <- diffcyt(sce, ei, 
    formula = ds_formula2, contrast = contrast,
    analysis_type = "DS", method_DS = "diffcyt-DS-LMM",
    clustering_to_use = "meta8")

# extract results
tbl_ds1 <- rowData(res_ds1$res)
tbl_ds2 <- rowData(res_ds2$res)

# there are 8*14 = 112 tests: 
# each state marker is tested in each cluster
8*length(state_markers(sce)) == nrow(tbl_ds1)
## [1] TRUE
# accounting for patient effect increases sensitivity:
# ~90 vs. 40 cluster-marker instances called differential
fdr <- 0.05
table(tbl_ds1$p_adj < 0.05)
## 
## FALSE  TRUE 
##    27    85
table(tbl_ds2$p_adj < 0.05)
## 
## FALSE  TRUE 
##    73    39
  1. Using plotDiffHeatmap(), visualize the top 50 hits for one of the result tables above.
  2. Give one example each of a marker whose expression
    is induced/repressed by BCRXL stimulation.
  3. Plot UMAPs colored by their expression and split by condition using argument facet_by.
# induced: pS6; repressed: pBtk, pNFkB
tbl <- rowData(res_ds1$res)
plotDiffHeatmap(sce, tbl, top_n = 50)

plotDR(sce, 
    dr = "UMAP", facet_by = "condition",
    color_by = c("pS6", "pBtk", "pNFkB"))